Table of contents

Broadcast analysis

First of all, import the datasets i obtained from ORFEO

bcast_default.df<- read.csv(file = "bcast_default.csv")
bcast_linear.df<- read.csv(file = "bcast_linear.csv")
bcast_chain.df<- read.csv(file = "bcast_chain.csv")
bcast_binarytree.df<- read.csv(file = "bcast_binarytree.csv")

Now we can bind them into a single df:

bcast.df<- rbind(bcast_default.df,bcast_linear.df,bcast_chain.df,bcast_binarytree.df)

Fix algorithm and see how latency changes when allocation varies

Bcast linear

At the moment let’s fix the size of the message to 2 MPI_CHAR

bcast_linear.df %>% filter(MessageSize == 2) %>%
  ggplot(aes(x = Processes, y = Latency, color = Allocation)) +
    geom_line(linetype = 1) +
    geom_point(size=1)+
    geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
    annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
    scale_color_manual(values = c("#27AE60","#E67E22","#3498DB")) +
    xlim(2, 48) +  # Adjust the x-axis limits as needed
    labs(x = 'N. of processes (#cores)',
         y = 'Latency',
         title = 'Linear Broadcast Latency vs #cores (MessageSize = 2)',
         color = 'Allocation')+
    theme_light()

Comments:

  • As expected, both allocation by socket and core reveals some “jumps” when changing node, while allocation by node proceeds “linearly”
  • Regarding the jump when changing socket, allocation by core reveals a jump, while allocation by socket proceeds linearly
  • Something strange (I cannot explain it damn 🥲) is the position of the jumps…Why at 30 and not at 24? Isac suggested something about strange settings on external variables…if we were on EPYC we could suspect something about NUMA regions, but in thin nodes we have just 1 single numa, so it is not our case.

Bcast chain

As we did above for linear broadcast, we fix the size of the message to 2 MPI_CHAR

bcast_chain.df %>% filter(MessageSize == 2) %>%
ggplot(aes(x = Processes, y = Latency, color = Allocation)) +
  geom_line(linetype = 1) +
  geom_point(size=1)+
  geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
  annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
  scale_color_manual(values = c("#27AE60","#E67E22","#3498DB")) +
  xlim(2, 48) +  # Adjust the x-axis limits as needed
  labs(x = 'N. of processes (#cores)',
       y = 'Latency',
       title = 'Chain Broadcast Latency vs #cores (MessageSize = 2)',
       color = 'Allocation')+
  theme_light()

Comments: - same consideration as before

Bcast BinaryTree

bcast_binarytree.df %>% filter(MessageSize == 2) %>%
ggplot(aes(x = Processes, y = Latency, color = Allocation)) +
  geom_line(linetype = 1) +
  geom_point(size=1)+
  geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
  annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
  scale_color_manual(values = c("#27AE60","#E67E22","#3498DB")) +
  xlim(2, 48) +  # Adjust the x-axis limits as needed
  labs(x = 'N. of processes (#cores)',
       y = 'Latency',
       title = 'BinaryTree Broadcast Latency vs #cores (MessageSize = 2)',
       color = 'Allocation')+
  theme_light()

Fix map allocation and compare different algorithms

Comparing Linear vs Chain

As for now, let’s focus on the comparison between Linear vs Chain

bcast.df %>% filter(MessageSize == 2)  %>% filter(Allocation == "core") %>% filter(Algorithm %in% c("Linear","Chain"))  %>%
  ggplot(aes(x = Processes, y = Latency, color = Algorithm)) +
    geom_line(linetype = 1) +
    geom_point(size=1)+
    geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
    annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
    scale_color_manual(values = c("#E67E22","#3498DB")) +
    xlim(2, 48) +  # Adjust the x-axis limits as needed
    labs(x = 'N. of processes (#cores)',
         y = 'Latency',
         title = 'Broadcast Latency vs #cores (MessageSize = 2)',
         color = 'Allocation')+
    theme_light()

Comments:

Remember that we are dealing with --map-by core

  • Region 1 (within socket): chain and linear performs ~ the same–> this reveals that probaably the communication intra socket (so between cores in the same socket) is so fast that the implementation algorithm does not impact significantly

    NB: here it might be interesting to show point to point latency (since we are working on a short size message, so we are considering “pure” latency). And from this observation then we can move to the plots where i increase message size to show the difference in algorithms performance

  • Region 2 (within node): chain and linear still perform ~ the same–> maybe also the communication intra node (between cores, even if they are in different sockets) is too fast that the difference is not so evident. Anyway, there is a small difference that highlights chain having a smaller latency. Reasoning on the implementation of chain, it makes sense: both because chain work with contiguous core (while flat has more “core-jumps”), both because the paper stated that MPI chain sends the message dividing data into chunks (while flat is not able to do that)

  • Region 3 (outside node): obviously in this region the latency is significantly higher (even if i cannot explain the position of the jump), and the difference between chain and linear algorithm is more evident. As explained before, this makes sense: chain works with contiguous cores, while flat tree sends from 0 to all other ranks!

BONUS: in the plot above the difference between algorithms was not so evident, but this might derive from the small size of the message. Even if flat tree is less efficient than chain, the fact that we are sending a small message (2 MPI_CHAR) makes both the algorithms super fast.

Recall that the choice of considering at first a small sized message was to exclude as much as possible the time needed for sending/storing the message and to focus as much as possible to the “pure” latency

BUT, if we now increase the size of the message, the different performances of the 2 algorithms will be more evident:

Plots=list()
sizes=c(64,256,2048)
for(i in 1:length(sizes)){
  # Filter by current size
  current_plot<- bcast.df %>% filter(Allocation == "core") %>% filter(Algorithm %in% c("Linear","Chain"))  %>% filter(MessageSize == sizes[i]) %>%
                  ggplot(aes(x = Processes, y = Latency, color = Algorithm)) +
                          geom_line(linetype = 1) +
                          geom_point(size=1)+
                          geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
                          annotate("text",x=c(12,24),y=c(0,0),label=c("12 Cores","24 cores"),vjust=0,hjust=-0.1,color="#616A6B")+
                          scale_color_manual(values = c("#E67E22","#3498DB")) +
                          xlim(2,48) +  # Adjust the x-axis limits as needed
                          labs(x = 'N. of processes (#cores)',
                               y = expression('Latency (µs)'),
                               title = paste0("Broadcast Latency vs #cores (MessageSize =",sizes[i],")"),
                               color = 'Allocation')+
                          theme_light()
  Plots[[i]]<- current_plot
}

Plots[[1]] | Plots[[2]] | Plots[[3]]

Comparing all bcast algorithms

Noticed that allocation by core is associated to the lowest latency, let’s fix Allocation to core and compare the algorithm behind the MPI broadcast operation.

Considering a message size = 2 MPI_CHAR we obtain:

bcast.df %>% filter(MessageSize == 2)  %>% filter(Allocation == "core") %>%
  ggplot(aes(x = Processes, y = Latency, color = Algorithm)) +
    geom_line(linetype = 1) +
    geom_point(size=1)+
    geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
    annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
    scale_color_manual(values = c("#27AE60","#E67E22","#3498DB","#8E44AD")) +
    xlim(2, 48) +  # Adjust the x-axis limits as needed
    labs(x = 'N. of processes (#cores)',
         y = 'Latency',
         title = 'Broadcast Latency vs #cores (MessageSize = 2)',
         color = 'Allocation')+
    theme_light()

Instead, if we let the size vary we get something similar as seen in Linear vs Chain

Plots=list()
sizes=c(64,256,2048)
for(i in 1:length(sizes)){
  # Filter by current size
  current_plot<- bcast.df %>% filter(Allocation == "core") %>% filter(MessageSize == sizes[i]) %>%
                    ggplot(aes(x = Processes, y = Latency, color = Algorithm)) +
                    geom_line(linetype = 1) +
                    geom_point(size=1)+
                    geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
                    annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
                    scale_color_manual(values = c("#27AE60","#E67E22","#3498DB","#8E44AD")) +
                    xlim(2, 48) +  # Adjust the x-axis limits as needed
                    labs(x = 'N. of processes (#cores)',
                         y = expression('Latency (µs)'),
                         title = paste0("Broadcast Latency vs #cores (MessageSize =",sizes[i],")"),
                         color = 'Allocation')+
                    theme_light()
  Plots[[i]]<- current_plot
}

Plots[[1]] | Plots[[2]] | Plots[[3]]

Fix algorithm and allocation and see how latency changes when message size varies

Linear

bcast_linear.df %>%
  filter(Allocation == "core") %>% filter(MessageSize %in% c(2,256,1024,2048,16384)) %>%
  ggplot(aes(x = Processes, y = Latency, color = factor(MessageSize))) +
  geom_line(linetype=1) +
  geom_point() +
  labs(
    x = "Processes",
    y = "Latency",
    title = "Latency vs Processes for Linear algorithm (Allocation='core')",
    color = "Message Size"
  ) +
  xlim(c(2,48))+
  theme_light()

Note : i did NOT considered all the messaze sizes since the latency grows exponentially-like w.r.t message size. We will consider it better when building a model…

Chain

bcast_chain.df %>%
  filter(Allocation == "core") %>% filter(MessageSize %in% c(2,256,2048,4096,8192)) %>%
  ggplot(aes(x = Processes, y = Latency, color = factor(MessageSize))) +
  geom_line(linetype=1) +
  geom_point(size=1) +
  labs(
    x = "Processes",
    y = "Latency",
    title = "Latency vs Processes for Chain algorithm (Allocation='core')",
    color = "Message Size"
  ) +
  xlim(c(2,48))+
  theme_light()

Binary tree

bcast_binarytree.df %>%
  filter(Allocation == "core") %>% filter(MessageSize %in% c(2,256,2048,4096,8192)) %>%
  ggplot(aes(x = Processes, y = Latency, color = factor(MessageSize))) +
  geom_line(linetype=1) +
  geom_point(size=1) +
  labs(
    x = "Processes",
    y = "Latency",
    title = "Latency vs Processes for BinaryTree algorithm (Allocation='core')",
    color = "Message Size"
  ) +
  xlim(c(2,48))+
  theme_light()

Performance Models

Now we fix allocation to core, as it was the more performant allocation for all the algorithms. Our aim is to fit a performance model able to predict latency basing on data

Disclaimer: it might be biased from my statistical background, but I was not able to implement latency models explained in the paper…

Model for linear alg

First of all, let’s plot the variables to spot some relations:

bcast_linear.df %>% 
  filter(Allocation == "core") %>%
  plot_ly(
    x = ~MessageSize,
    y = ~Processes,
    z = ~Latency
  )%>%
  add_markers(
    marker = list(color = ~Latency,
                  line = list(color = "black",width = 0.8 ))
  ) %>%
  layout(
    xaxis = list(title = "Processes"),
    yaxis = list(title = "Latency"),
    title = "Latency vs Processes for BinaryTree algorithm (Allocation='core')"
  )

We can implement a simple linear model (it makes sense to remove the intercept)

fit.linear<- lm(formula= Latency ~ -1+MessageSize+Processes,
                data=bcast_linear.df %>% filter(Allocation == "core")
                )
summary(fit.linear)
## 
## Call:
## lm(formula = Latency ~ -1 + MessageSize + Processes, data = bcast_linear.df %>% 
##     filter(Allocation == "core"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1133.70   -32.02   -20.19    -9.89   650.40 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## MessageSize 1.136e-03  2.454e-05  46.294  < 2e-16 ***
## Processes   1.083e+00  2.269e-01   4.775 2.36e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 137.4 on 502 degrees of freedom
## Multiple R-squared:  0.838,  Adjusted R-squared:  0.8373 
## F-statistic:  1298 on 2 and 502 DF,  p-value: < 2.2e-16

Let’s give a marginal look at our response variable:

hist(bcast_linear.df$Latency,breaks = 70,
     xlab="Latency(us)",
     main="Histogram of Latency (Linear Broadcast)")

EVIDENT SKEWNESS! Let’s try to apply log2 transformation (in the model I will apply it both to Latency and MessageSize):

bcast_linear.df %>% 
  filter(Allocation == "core") %>%
  plot_ly(
    x = ~log2(MessageSize),
    y = ~Processes,
    z = ~log2(Latency)
  )%>%
  add_markers(
    marker = list(color = ~log2(Latency),
                  line = list(color = "black",width = 0.8 ))
  ) %>%
  layout(
    xaxis = list(title = "Processes"),
    yaxis = list(title = "Latency"),
    title = "Latency vs Processes for BinaryTree algorithm (Allocation='core')"
  )

The linear model becomes:

fit.linear.log<- lm(formula= log2(Latency) ~ -1+log2(MessageSize)+Processes,
                data=bcast_linear.df %>% filter(Allocation == "core")
                )
summary(fit.linear.log)
## 
## Call:
## lm(formula = log2(Latency) ~ -1 + log2(MessageSize) + Processes, 
##     data = bcast_linear.df %>% filter(Allocation == "core"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4898 -1.1118 -0.5450  0.3623  3.6810 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## log2(MessageSize) 0.321259   0.009308   34.51  < 2e-16 ***
## Processes         0.029815   0.003808    7.83 2.92e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.62 on 502 degrees of freedom
## Multiple R-squared:  0.8824, Adjusted R-squared:  0.8819 
## F-statistic:  1883 on 2 and 502 DF,  p-value: < 2.2e-16

Comment: \(R^2_{adj}\) increases from 83,7% to 88,2% !!!

Model for chain alg

Same considerations done above…

fit.chain.log<- lm(formula= log2(Latency) ~ -1+log2(MessageSize)+Processes,
                data=bcast_chain.df %>% filter(Allocation == "core")
                )
summary(fit.chain.log)
## 
## Call:
## lm(formula = log2(Latency) ~ -1 + log2(MessageSize) + Processes, 
##     data = bcast_chain.df %>% filter(Allocation == "core"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3286 -1.3089 -0.5489  0.6052  3.1535 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## log2(MessageSize) 0.306713   0.009139  33.560  < 2e-16 ***
## Processes         0.021957   0.003739   5.873  7.8e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.591 on 502 degrees of freedom
## Multiple R-squared:  0.8683, Adjusted R-squared:  0.8677 
## F-statistic:  1654 on 2 and 502 DF,  p-value: < 2.2e-16

Model for binarytree alg

fit.binarytree.log<- lm(formula= log2(Latency) ~ -1+log2(MessageSize)+Processes,
                data=bcast_binarytree.df %>% filter(Allocation == "core")
                )
summary(fit.binarytree.log)
## 
## Call:
## lm(formula = log2(Latency) ~ -1 + log2(MessageSize) + Processes, 
##     data = bcast_binarytree.df %>% filter(Allocation == "core"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4247 -1.3128 -0.6351  0.5242  3.6288 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## log2(MessageSize) 0.317449   0.009207  34.480  < 2e-16 ***
## Processes         0.016315   0.003766   4.332 1.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.603 on 502 degrees of freedom
## Multiple R-squared:  0.8663, Adjusted R-squared:  0.8658 
## F-statistic:  1626 on 2 and 502 DF,  p-value: < 2.2e-16